The Arm Statistical Profiling Extension (SPE) is an architectural feature designed for enhanced instruction execution profiling within Arm CPUs. This feature has been available since the introduction of the Neoverse N1 CPU platform in 2019, along with performance monitor units (PMUs) generally available in Arm CPUs. An important step in extracting value from capabilities like SPE and PMUs is the tooling, documentation, and examples to form a top-down solution for SoC telemetry. Six engineers at Arm recently published a detailed white paper on the use of SPE for performance analysis. Their approach and findings are summarized here. This blog post aims to introduce the concept of using SPE for performance analysis and root cause analysis, targeting software developers, performance analysts, and silicon engineers.
Arm SPE is a hardware-assisted CPU profiling mechanism that offers detailed profiling capabilities. It records key execution data, including program counters, data addresses, and PMU events. SPE enhances performance analysis for branches, memory access, and more, making it useful for software optimization. SPE data can be applied for precise sampling in source code hotspot detection, memory access analysis, and data sharing analysis using tools like the Linux perf tool. SPE sampling involves four stages: statistical selection of operations, recording key execution information, post-filtering of sample records, and storing records in memory. It enables efficient profiling and data extraction using monitoring tools. SPE uses a down counter to periodically select micro-operations for profiling. SPE sample records capture the execution lifecycle of an operation, starting at the CPU backend.
SPE profiling can be enabled using the Linux perf tool for data collection. Arm has also released a helper tool called SPE-Parser to support data collection and analysis of SPE traces. It can export data to both CSV and Parquet formats, providing additional flexibility for analysis.
Download whitepaper
There are four case studies in the published SPE whitepaper that help illustrate the capabilities of this feature.
The first case study is one where the SPE feature was used to optimize the Apache Arrow CSV writer code. The result was a 40% performance improvement on a Neoverse N1 platform. The case study involved measuring Instructions Per Cycle (IPC) and bandwidth in GB/s, then further examined MPKI and miss ratios. Next, operation mix metrics reveal the CSV Writer workload's high reliance on integer instructions and branches, indicating potential vectorization opportunities. Profiling for L1D cache events and branch mispredictions exposed issues related to the memcpy function, which experiences frequent cache misses and branch mispredictions. The case study then analyzed branches within memcpy to suggest an inefficient buffer size as the source of branch mispredictions. The case study highlights the complexity of combining memcpy operations for data fields and delimiters in the Arrow CSV Writer, contributing to suboptimal CPU branch prediction.
After performing this analysis using SPE, the real fun began – optimizing the CSV writer's code. The first stage was to introduce a helper function called copy_separator to improve the copying of delimiter and end-of-line characters within the hot loop. The optimized code was benchmarked. Resulting in a throughput increase from 1.5~1.8 GB/s to 2.1 GB/s, reduced total instructions, and an increase in Instructions Per Cycle (IPC) from 2.22 to 2.58. Branch-related metrics were significantly reduced, contributing to an overall performance uplift. We recommend using the Sampling Profiling Extension (SPE) for hotspot analysis and root cause detection when optimizing code on Arm Neoverse cores.
You can find the upstream pull request for this optimization on GitHub: https://github.com/apache/arrow/pull/13394.
SPE-based profiling can provide valuable insights into memory operations, including memory latency, execution latency, and data source information. This analysis can help identify bottlenecks and performance issues related to memory access and can capture some of the data that typically requires an LMBench run. How is that?
SPE records hierarchical data source hits for memory loads, and the data source encoding depends on the system's cache hierarchy, such as L1 data cache, L2 cache, LLC, or RAM.
Analyzing memory access using SPE-profiled data involves filtering records to focus on the benchmark portion of the code, deriving latency values, and examining PMU events triggered by memory operations. The memory use case example shows that SPE-derived latencies are close to the LMBench-reported latencies. SPE data can also help analyze performance issues, such as TLB misses. The data source information derived from SPE matches well with the hierarchical memory access data sources. It can help identify where memory accesses hit in the cache hierarchy.
SPE Profiling can also be used to estimate memory bandwidth. Although it is important to note that SPE is a statistical measurement tool based on sampled operations and is not highly accurate for memory bandwidth measurements. SPE can be useful for relative measurements during optimization exercises and sensitivity studies, particularly for code with predictable and well-known memory access patterns, such as micro-kernels. The SPE-parser tool, introduced in SPE monitoring tools, is used to process the raw SPE-profiled data collected with the Linux perf tool. The output provides valuable information about the profiling results. The memory read bandwidth is estimated statistically from the filtered SPE samples. To calculate this estimation, the total memory size read by the benchmark is divided by the total execution time.
A final use case to highlight is SPE profiling for data sharing analysis in multi-threaded workloads. Data sharing issues can lead to performance problems, particularly when multiple threads work on the same data set, causing cache coherency overhead. False sharing is a common performance problem. It occurs when one processor modifies data items on a cache line while another processor works on different parts of that cache line. False sharing can lead to cache invalidation and reduced performance.
The Linux perf c2c tool analyzes memory access data obtained from SPE, including data source information, data addresses, and instruction PC addresses. Perf c2c helps detect false sharing issues by providing information about cache line addresses with potential problems. The data offsets in cache lines accessed by different processes, instruction addresses, local or remote (cross-socket) access information, and NUMA nodes involved. The tool reports details about cache lines with potential false sharing and data offsets accessed by different threads, allowing developers to locate and address false sharing problems. SPE profiling, combined with the perf c2c tool, can be used to identify and resolve false sharing issues in multi-threaded applications, ultimately improving performance.
The Arm Statistical Profiling Extension (SPE) is a powerful feature that provides detailed CPU profiling capabilities with uses including general code optimization, memory access analysis, memory latency estimation, and data sharing analysis. This blog post introduces SPE and its utility for software developers, performance analysts, and silicon engineers. We refer to the SPE performance methodology whitepaper published by Arm for details on the content of this blog.
SPE enables precise sampling for source code hotspot detection, memory access analysis, and data sharing issues. It enhances performance analysis by recording key execution data, and its integration with tools like Linux perf empowers users to analyze and optimize their code effectively. This comprehensive toolset enables users to identify bottlenecks, improve software performance, and enhance their understanding of CPU behavior.
SPE-parser tool: https://gitlab.arm.com/telemetry-solution/telemetry-solution/-/tree/main/tools/spe_parser?ref_type=heads